Cephalometric landmark detection without X-rays combining coordinate regression and heatmap regression

Fully automated techniques using convolutional neural networks for cephalometric landmark detection have recently advanced. However, all existing studies have adopted X-rays. The problem of direct exposure of patients to X-ray radiation remains unsolved. We propose a model for detecting cephalometric landmarks using only facial profile images without X-rays. First, the model estimates the landmark coordinates using the features of facial profile images through high-resolution representation learning. Second, considering the spatial relationship of the landmarks, the model refines the estimated coordinates. The estimated coordinates are input into fully connected networks to improve the accuracy. During the experiment, a total of 2000 facial profile images collected from 2000 female patients were used. Experiments results suggested that the proposed method may perform at a level equal to or potentially better than existing methods using cephalograms. We obtained an MRE of 0.61 mm for the test data and a mean detection rate of 98.20% within 2 mm. Our proposed two-stage learning method enables a highly accurate estimation of the landmark positions using only facial profile images. The results indicate that X-rays may not be required when detecting cephalometric landmarks.


Materials and methods
The study protocol was approved by the Institutional Review Board (IRB) of our institutes (Tokyo Dental College, No. 1091, 2021-12-17).All methods were carried out in accordance with the Helsinki Declaration principles and relevant guidelines and regulations.Informed consent was obtained from all the patients, and written informed consent was obtained from three patients, including the facial images, allowing us to use their information/ images in an online open-access publication.
Figure 1 presents an overview of the proposed method.The proposed method comprises two stages: HRNetV2 31 and MLP.Unlike with existing studies, we used the facial profile images as input instead of a cephalogram.First, all landmark locations in the input images are estimated through heatmap regression using HRNetV2.In this step, the model learns the relationship between the local features of the image and landmarks.As described later in Ablation Study section, the accuracy of the estimation is insufficient, however.Therefore, we introduced a coordinate regression by MLP, which significantly improved the accuracy of landmark estimation.MLP was adopted to estimate the spatial location of the landmarks, i.e., their relative location.This two-stage approach, coarse estimation using heatmap regression and a fine estimation using MLP, enables the accurate detection of all landmarks.

Dataset
A total of 2000 lateral cephalograms and profile photograph images collected from 2000 female patients aged 14 to 69 years (mean age of 27 years) were provided by the Tokyo Dental College.These images were acquired in jpeg format using a CX-150S (ASAHIROENTGEN, Tokyo, Japan) and an EOS Kiss X90 (Canon, Tokyo, Japan).The standard for taking lateral cephalogram is set worldwide.The film should be kept parallel to the mid-sagittal plane of the head and set the head with ear rods so that the center line of the X-ray beam passes through the axes of the left and right ear rods.The distances from the X-ray tube to the mid-sagittal plane of the head and from the mid-sagittal plane of the head to the film is assumed to be 150 cm and 15 cm, respectively.The method of taking a profile photograph image is the same as that of a lateral cephalogram.That is, the distance from the camera to the midsagittal plane of the head is 150 cm, and the head is fixed with ear rods.The skeletal information obtained from the lateral cephalograms was superimposed on the profile photograph images using Quick Ceph Studio (ver.5.0.2,Quick Ceph Systems, San Diego, California).The 23 landmarks illustrated in Fig. 2 were manually plotted by two orthodontists and finally checked by an experienced orthodontist.The two of them are the board members of Orthodontic specialists.These 23 landmarks were selected from the measurement  www.nature.com/scientificreports/points of the Downs method, Northwestern method, and a Ricketts analysis.Coordinates of landmarks on the lateral image were identified by imglab, an annotation tool included with Dlib 35 .Each image has 1200 × 1200 pixels and the pixel spacing for the device was 0.35.We present the averaged results of the experiment based on fivefold cross-validation, each containing 400 test and 1600 training profile photograph images, respectively.

Superimposition of the cephalometric tracing on the facial profile image
Figure 3 shows the procedure for plotting landmarks and superimposing the cephalometric tracing on the facial profile image conducted as follows: 1. Plot 23 landmarks and trace the profile (from forehead to upper lip) on the cephalogram within the computer screen of Orthodontic analysis software (Quick Ceph Studio).2. Separate the set of the landmark plots and the cephalometric tracing on the screen.3. Superimpose the cephalometric tracing on the facial profile image by manually matching the line from nose to upper lip of the tracing with that of the facial profile image on another screen.
Five sets of the landmark plots and the cephalometric tracing created in step 1 were randomly selected and the same set was superimposed on the facial profile image by each observer according to steps 2 and 3. We measured the intra-and inter-observer landmark distance errors for five landmarks (Sella, Porion, Menton, Gonion, Basion) to confirm the accuracy and reliability of the superimposition.The reason of selecting the five landmarks, it can be considered that the further away from the profile tracing line the larger the error because the superimposition is conducted based on the profile tracing line as shown in step 3. Therefore, the five points farthest from the profile tracing line were selected.To evaluate intra-observer variability, superimpositions were conducted by one orthodontist three times with an interval of two weeks.To evaluate inter-observer variability, superimpositions were conducted by three orthodontists.The two of them are the board member of Orthodontic specialist.The mean and standard deviation of the intra-and inter-observer landmark distance error and the intraclass correlation coefficient (ICC) for the landmark plots were calculated for each set with 95% confidence intervals.In calculating the ICC, the Shapiro-Wilk test was used to determine whether the data were normally distributed.All statistical evaluations were conducted by SPSS software (ver.27.0, IBM, Armok, New York).

Repeatability and reproducibility test of manual landmark plotting
To compare the repeatability and reproducibility of orthodontists' landmark prediction errors, a total of five cephalograms from five patients were randomly selected and 23 landmarks were plotted.To evaluate intraobserver variability, cephalometric landmark plotting was conducted by one orthodontist three times with an interval of two weeks.To evaluate inter-observer variability, cephalometric landmark plotting was conducted by three orthodontists.The mean and standard deviation of the intra-and inter-observer prediction error and the ICC for each landmark were calculated for each set with 95% confidence intervals.In calculating the ICC, the Shapiro-Wilk test was used to determine whether the data were normally distributed.All statistical evaluations were conducted by SPSS software.

Heatmap regression
To estimate the location of landmarks, we used HRNetV2 to generate heatmaps of the landmarks.Figure 4 shows the structure of HRNetV2 31 , which starts with a high-resolution subnetwork (stem) and adds high-to lowresolution subnetworks successively to form a stage.The multi-resolution subnetworks are connected in parallel to form a total of four stages.The second, third, and fourth stages are set up by repeating the modularized multiresolution blocks.The exchange unit, as illustrated in Fig. 5, aggregates the information of each resolution from other subnetworks 32 .Strided 3 × 3 convolutions with stride 2 and a simple nearest neighbor sampling are applied for downsampling and upsampling, respectively.The last exchange unit outputs a feature map where the lowresolution representation is upsampled and concatenated into a high-resolution representation.The heatmap is regressed from the high-resolution representation.A previous study 31 reported that the performance is enhanced by employing all resolution representations in comparison to solely using a high-resolution representation.
The ground truth heatmaps are defined as 2D gaussian functions with standard deviation of σ centered on the ground truth location of L G i : where, x , i = {1, ...N} , and N(= 23) represent the heatmap pixel, index of the landmarks, and total number of landmarks, respectively.To converge the estimated landmarks as close as possible to the ground truth landmarks, the loss function employs the L 2 loss as follows: where h P i (x; w) represents the estimated heatmap, and w denotes the weights and biases of the network.The loss function allows the network to learn the relationship between the local features and landmarks.The heatmap comprises a 2D matrix with 23 channels corresponding to the number of landmarks.The estimated coordinates L P i of each landmark are predicted by transforming from the reduced space to the original image space.We adjusted the offset of the largest value position from the largest value to the second largest value 36 .Accordingly, the estimated position can be expressed as a set of 2D coordinates {x P i , y P i } N i=1 .
(1) Illustration of how to aggregate information of multiple resolutions using an exchange unit 32 .A stride 3 × 3 convolution is adopted for downsampling.Nearest neighbor sampling followed by a 1 × 1 convolution is used for upsampling.

Coordinate regression
MLP is a refinement of coordinates estimated using spatial relationships between landmarks.It has a simple three-layer structure with an input layer, a hidden layer, and an output layer.Because this is a simple regression task, we adopted a simple MLP to reduce the possibility of an overfitting.In 2 , a linear filter was also employed to refine the landmark positions; however, only some landmarks were adjusted, and the inputs were the combination of outputs from the two models when considering spatial information.The estimated coordinates {x P i , y P i } N i=1 are divided into {x P 1 , ..., x P N } and {y P 1 , ..., y P N } .The x-and y-coordinates are input into separate models.We adopted the L 2 loss as the loss function to get closer to the ground truth landmarks by using the positional relationship of the landmarks: where x G i and y G i represent the ground truth coordinates, and x R i and y R i denote the refined coordinates.Losses in the MLP were not propagated to HRNetV2, and were learned independently.The number of training epochs for HRNetV2 and MLP differed.For each epoch used for training HRNetV2, we trained the MLP with multiple epochs, provided that we initialized the parameters only at the beginning of training.This procedure allowed us to train the entire model efficiently, while fine-tuning the MLP to match the training phase of HRNetV2.

Implementation details
We trained and carried out testing using a GeForce GTX 1080, 3.70-GHz Intel(R) Core(TM) i7-8700K CPU, and 16GB of memory.The training and testing were conducted in Pytorch.The input images were cropped and resized to 256 × 256 , according to the center positions of the boxes.
The HRNetV2 network starts with a stem comprising two strided 3 × 3 convolutions, which reduces the resolution to 1/4.As the inputs pass through the four subsequent stages, the resolution is gradually reduced by half, and the number of channels is accordingly doubled.The first stage contains four residual units, each of which was formed by a bottleneck with a width of 64.This stage is followed by a 3 × 3 convolution, which reduces the width of the feature map to 18. Thus, the number of channels for the four resolutions is 18, 36, 72, and 144, respectively.The second, third, and fourth stages contain one, four, and three multi-resolution blocks, respectively.One multi-resolution block contains four residual units.Each unit contains two 3 × 3 convolutions for each resolution and an exchange unit across resolutions.The four resolution representations from the fourth stage are concatenated and used to predict heatmaps with a width of 64, following two 1 × 1 convolutions.We trained HRNetV2 with 60 epochs and a batch size of 16.The model was pre-trained using WFLW 37 .The base learning rate was set as 0.0001, and then was reduced to 0.00001 and 0.000001 at 30 and 50 epochs, respectively.The loss function was minimized using the Adam optimizer.
The MLP network comprises three layers: input, hidden, and output layers.The number of neurons in the middle layer was set to 500.The number of inputs and outputs was set to 23 for splitting the coordinates transformed from the heatmaps and predicted by HRNetV2 into x-and y-coordinates.We trained the MLP using 100 epochs and a batch size of 16 every time HRNetV2 was trained for 1 epoch.The loss function was minimized using the Adam optimizer with a learning rate of 0.00001 and a weight decay (L2 regularization) factor of 0.0001.

Evaluation metrics
We evaluated the proposed method in terms of the mean radial error (MRE), successful detection rate (SDR), and successful classification rate (SCR), according to a previous benchmark study 13 .The radius error was defined by R = ∆x 2 + ∆y 2 , where ∆x and ∆y represent the Euclidean distances between the estimated landmarks and the ground truth landmarks of the x-and y-axes, respectively.The MRE and the standard deviation (SD) are defined as follows: where R i represents the radial error of the ith landmark, and n denotes the total number of landmarks to be detected.The SDR is the ratio of estimated landmarks that are within a reference threshold, and is defined as follows: where n d represents the number of successfully detected landmarks, and the threshold values are 2.0, 2.5, 3.0, and 4.0 mm, as typically used.The SCR is the classification accuracy of anatomical face types based on eight clinical measures (ANB, SNB, SNA, overbite depth indicator (ODI), anteroposterior dysplasia indicator (APDI), facial height index (FHI), frankfurt mandibular angle (FMA), modified wits (MW)).Facial images are classified into three anatomical types under clinical measures.Note that geometric criteria such as the angles and distances between landmarks listed in Table 1 are considered.

Accuracy and reliability of the superimposition of cephalometric tracing on the facial profile image
All distance errors were considered to follow a normal distribution by the Shapiro-Wilk test (p>0.05;not shown for details).The mean landmark distance error, standard deviation and the ICC values for each set are shown in Table 2.The mean intra-and inter-observer landmark distance error and standard deviations were 0.32 mm ± 0.07 mm and 0.41 mm ± 0.11 mm, respectively, indicating superiority over previous studies 18 .The ICC of the mean landmark distance error for each patient ranged from 0.993 to 0.998 (< 0.00, poor; 0.00-0.20,slight; 0.21-0.40,fair; 0.41-0.60,moderate; 0.61-0.80,substantial; 0.81-1.00,almost perfect) 38 , indicating a high degree of intra-and inter-observer agreement.The 95% confidence interval also supports intra-and inter-observer agreement.In this way, the accuracy and reliability of the superposition was confirmed.

Result of repeatability and reproducibility test of manual landmark plotting
All distance errors were considered to follow a normal distribution by the Shapiro-Wilk test (p>0.05;not shown for details).The mean prediction error, standard deviation and the ICC values for each landmark are shown in Table 3.The mean intra-and inter-observer prediction error and standard deviations were 0.38 mm ± 0.11 mm and 0.37 mm ± 0.11 mm, respectively, indicating superiority over previous studies 18 .The ICC of the prediction error for each landmark ranged from 0.992 to 0.999 (< 0.00, poor; 0.00-0.20,slight; 0.21-0.40,fair; 0.41-0.60,moderate; 0.61-0.80,substantial; 0.81-1.00,almost perfect) 38 , indicating a high degree of intra-and inter-observer agreement.The 95% confidence interval also supports intra-and inter-observer agreement.In this way, intra-and inter-observer agreement was confirmed.

Performance of cephalometric landmark detection for the proposed model and comparison with existing methods
Table 4 presents a comparison between the results of the proposed and existing methods.The proposed method achieves the best performance under all metrics.Because no existing studies have solely adopted facial profile images for training, we provide comparisons with a study adopting X-rays.The comparison methods were selected based on their recency and proximity to the proposed method.The performance of the proposed Table 1.Criteria for eight clinical measures of anatomical face-type classification in SCR.ANB is the angle between L5, L2, and L6.SNB is the angle between L1, L2, and L6.SNA is the angle between L1, L2, and L5.ODI is the arithmetic sum of the angle between the lines from L5 to L6 and from L8 to L10, and the angle between the lines from L3 to L4 and from L17 to L18.APDI is the arithmetic sum of the angle between the lines from L3 to L4 and from L2 to L7, the angle between the lines from L2 to L7 and from L5 to L6, and the angle between the lines from L3 to L4 and from L17 to L18.FHI is the ratio of the distance from L1 to L10 (PFH) to the distance from L2 to L8 (AFH).FMA is the angle between the line from L1 to L2 and the line from L10 to L9. MW is the distance between L12 and L11.model trained on facial profile images is presented in Table 7. Table 5 presents the MRE, SD, and SDR for each of the 23 landmarks.The MRE for all landmarks is less than 2 mm, which is clinically acceptable 6 .Among the landmarks, Basion exhibited the best MRE.However, the MRE of Sella and Mo are large, and the SDR at 2 mm is low.The results of the proposed method indicates that Orbitale, Subspinale, Pogonion, Gonion, and Articulare, which have been known to handle landmarks that are difficult to accurately estimate in previous studies 7,8,13,19 , can be estimated within 1 mm.In addition, it can be seen that the error in the y-coordinate is larger than that in the x-coordinate.Table 6 presents the SCR of the proposed method and the existing approachs.In terms of classification, the proposed method outperforms the existing methods under all metrics.The classification accuracy exceeds 90% in six out of eight metrics.Fig. 6 shows the MRE and loss of training and test according to the number of cycles.We defined one cycle as the process of training one epoch of HRNetV2 followed by 100 epochs of MLP.

Ablation study
We evaluated the contribution of MLP to the proposed method.We compared the performance of HRNetV2 31 with the performance of HRNetV2 combined with MLP.Table 7 presents the MRE, SD, and SDR for each model.Accordingly, it was deduced that the application of MLP significantly improves the performance.This  indicates that heatmap regression, followed by coordinate regression using MLP, works effectively.The MRE for each landmark is presented in Fig. 7.The accuracies for Sella, Porion, Gonion, Basion, and Articulare, where HRNetV2 exhibits a poor accuracy, were also significantly improved.Table 8 presents the SCR for each.From this Table, it is evident that the accuracy is significantly improved, except for MW, which is consistently high.Figure 8 presents a visualization of the estimations by HRNetV2 and HRNetV2 + MLP, including the ground truth locations of the landmarks.

Discussion
The proposed method is novel because it estimates the location of cephalometric landmarks without lateral cephalograms.Table 7 shows that the MRE of the proposed method is 0.61 mm.While a direct comparison is difficult due to differences in datasets, the proposed method demonstrates higher performance compared to the previously reported error 18 between expert clinicians and existing methods.The experimental results suggest that the proposed method may perform at a level equal to or potentially better than existing methods using cephalograms.Although a manual analysis is limited by inter-and intra-person errors, the proposed model achieves an SD of 1 mm for all landmarks except for Sella.This implies that the proposed model can provide a stable estimation with small deviations.Furthermore, the proposed method achieves a significant SDR and is free www.nature.com/scientificreports/from the effects of extreme outliers 18,19 which have limited several previous studies.Therefore, a cephalometric analysis using AI may replace a human analysis in the future.The accuracy is also high for landmarks that have been considered difficult to estimate accurately in existing studies.Figure 7 indicates that the estimations of Gonion and Articulare by HRNetV2 are still inaccurate.This may be due to the influence of the ear rod used to fix the head position, which is commonly included in both lateral cephalograms and facial profile images.This supports the idea that CNN-based heatmap regression methods learn each landmark independently.In particular, HRNetV2 strictly learns the positional relationship of each landmark with the surrounding features owing to the parallel distributed processing with many convolutional layers applied.Following the estimation with HRNetV2, the refinement of the landmark locations using MLP contributes to an improvement of the estimation accuracy of all landmarks.It is difficult to learn intricate positional relationships using only MLP.Note that the estimation in HRnetV2 described in the previous section may include the potential structural relationships among landmarks.Because MLP is fully connected from the input layer to the output layer and the input location information is processed comprehensively, it is possible to explicitly learn this potential positional relationship.This should allow MLP to incorporate the structural     www.nature.com/scientificreports/accuracy.In addition, enhancing the heatmap regression procedure can further enhance the overall performance.We can introduce adaptive wing loss 34 or face boundary prediction 33 , which achieves a high accuracy in facial landmark detection.As demonstrated in Table 5, Sella, and Mo have a larger MRE and lower SDR at 2 mm than the other landmarks.In particular, Sella has a larger error in the y-direction.Two reasons can be given for the estimation difficulty: First, Sella is the highest position among the landmarks to be estimated in most cases, and its location is far from the jaw area where the landmarks are densely located.Second, Mo has large errors in both the x-and y-directions.It appears that the variation of the position among patients biased the model.The error tends to be larger in the y-direction than in the x-direction.The shape of the face may influence the estimation.Most of the patients in this study have long faces, and landmarks are scattered in the y-direction.This may explain why the estimation error in the y-direction is larger than that in the x-direction.
Table 6 implies that the proposed method achieves a significant SCR.Because the SCR is calculated using linear and angular measurements, it is not influenced by the pixel spacing.Therefore, we can make reliable comparisons, even with different datasets.Note that the dataset used in this study did not include patients with a malocclusion.However, a clinical evaluation requires the inclusion of such data.
The performance of the proposed method tends to depend on the size and diversity of the training data.Previous studies 39 have reported that the accuracy of the proposed method increases with an increase in the number of data.Hence, adding more data is the easiest way to improve the system.However, this study also includes the risk of bias regarding the validation discussed in Schwendicke et al. 40 .Because the MLP approach is modelagnostic, there is no restriction on the heatmap regression model used for the initial estimation.The proposed method can reconcile the local features of images with the spatial relationship between landmarks.Hence, it may be widely used for landmark detection tasks such as facial landmark detection and pose estimation.In the future, it will be necessary to address 3D landmark detection.

Conclusion
In this paper, we proposed a novel cephalometric landmark detection method without the use of X-rays.The proposed framework combines the CNN-based heatmap regression model (HRNetV2) with a coordinate regression model (MLP).HRNetV2 estimates the location of landmarks by learning the relationship between local features and landmarks.However, it is limited by the same problem as conventional heatmap regression methods, and its accuracy is insufficient.The MLP following HRNetV2 can learn the spatial positional relationship between landmarks, which significantly improves the accuracy of the estimation.In experiments conducted with the created dataset, the proposed method performed remarkably with an MRE of 0.61 mm and a detection rate within 2.0 mm of 98.20%.The proposed method also achieved a high accuracy in terms of anatomical face-type classification, where a reliable comparison between different datasets is possible.In the future, in addition to replacing existing methods using X-rays, the proposed method may also replace measurements by humans.Because the proposed landmark refinement using MLP is model-agnostic, it can be combined with conventional methods.Furthermore, it has the potential to be a prominent approach applicable to various landmark detection tasks and provide significant improvements in performance.

Figure 1 .
Figure 1.Overview of the proposed model.The model is divided into HRNetV2 and MLP.After HRNetV2 applies the heatmap regression based on the feature extraction, MLP estimates the coordinates based on the spatial relationship of the landmarks.

Figure 3 .
Figure 3. Procedure for plotting landmarks and superimposing the cephalometric tracing on the facial profile image.

Figure 6 .
Figure 6.Performances of deep convolutional neural network-based AI model.(a) MRE of training and test according to the number of cycles.(b) HRNetV2 Loss of training and test according to the number of cycles.(c) MLP Loss of training and test to the number of cycles.Cycle means the process of training one epoch of HRNetV2 followed by 100 epochs of MLP.

Figure 7 .
Figure 7. MRE for each landmark, with and without MLP.

Figure 8 .
Figure 8. Visualization of landmarks, with and without MLP.(a,b) The best and worst result of the model with MLP, respectively.Blue and green dots show the estimations by HRNetV2 and HRNetV2+MLP, respectively, and the red dots indicate the ground truth.

Table 3 .
Intra-and inter-observer mean prediction error, standard deviation and the ICC values with 95% confidence intervals for 23 landmarks.

Table 4 .
13mparison of MRE and SDR for the automated cephalometric analysis systems.All methods were trained by the ISBI2015 dataset13.The averages using test1 and test2 are reported.Significant values are in bold.

Table 5 .
MRE, SD, and SDR for each landmark.

Table 6 .
Comparison of SCR for the automated cephalometric analysis systems.Numbers are given in percentages (%).Significant values are in bold.

Table 7 .
Performance of the method and comparison of MRE, SD, and SDR, with and without MLP.

Table 8 .
SCR comparison, with and without MLP.